Preserve SpeechLM perception checkpoint dtype by DongjiGao · Pull Request #15686 · NVIDIA-NeMo/NeMo

DongjiGao · 2026-05-11T17:19:10Z

Summary

Stop forcing the SpeechLM vLLM perception module to FP32 when loading checkpoint weights.
Keep raw audio input/preprocessing in FP32, then cast processed features to the encoder checkpoint dtype before encoder execution.
Cast final audio embeddings to the initialized LLM dtype before inserting them into the language model stream.

Test plan

python3 -c "import ast; ast.parse(open('/home/dongjig/NeMo_merge/nemo/collections/speechlm2/vllm/salm/model.py').read()); ast.parse(open('/home/dongjig/NeMo_merge/nemo/collections/speechlm2/modules/perception.py').read())"
VoxPopuli full local vLLM plugin check on NemotronH SpeechLM: WER 9.07, RTFx 814.54 at /data/dongjig/results/quantization/speechlm_bf16_perception_checkpoint_dtype_voxpopuli_20260511_095950/result.json.
ASR leaderboard dtype comparison completed for LibriSpeech clean/other, TEDLIUM, SPGISpeech, VoxPopuli, GigaSpeech, and first 1024 Earnings22 samples under /data/dongjig/results/quantization/speechlm_leaderboard_perception_dtype_20260510_150218; no meaningful WER regression observed for BF16 perception.

Avoid forcing the SpeechLM audio perception module to FP32 during vLLM inference so BF16 checkpoints can run the encoder in their stored dtype while keeping raw audio preprocessing in FP32. Signed-off-by: Dongji Gao <dongjig@nvidia.com>

copy-pr-bot · 2026-05-11T17:19:14Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Move the processed-feature dtype cast out of the shared perception module and into the SpeechLM vLLM model path so this fix remains scoped to plugin inference. Signed-off-by: Dongji Gao <dongjig@nvidia.com>

Keep the dtype conversion scoped to the SpeechLM vLLM plugin path and leave the shared perception module unchanged. Signed-off-by: Dongji Gao <dongjig@nvidia.com>

pzelasko

Great fix but I think there is too much defensive dtype casting, can we minimize to the absolutely necessary ones only?

Call the audio preprocessor directly before casting features to the perception encoder dtype, keeping the dtype fix scoped to the plugin inference path. Signed-off-by: Dongji Gao <dongjig@nvidia.com>

Keep the perception module in the checkpoint dtype while loading the original tensors directly. Signed-off-by: Dongji Gao <dongjig@nvidia.com>

Preserve the existing BF16 LLM boundary cast and keep the PR focused on avoiding FP32 perception weights. Signed-off-by: Dongji Gao <dongjig@nvidia.com>

Keep raw audio preprocessing in FP32 and run perception in BF16 for the vLLM plugin path without extra defensive dtype detection. Signed-off-by: Dongji Gao <dongjig@nvidia.com>

Perception outputs already follow the plugin perception dtype, so avoid an extra cast before returning audio embeddings. Signed-off-by: Dongji Gao <dongjig@nvidia.com>

Rely on AudioPerceptionModule to handle preprocessing and encoder handoff after the plugin sets the perception module dtype. Signed-off-by: Dongji Gao <dongjig@nvidia.com>

pzelasko · 2026-05-11T20:20:12Z

/ok to test 82fa4af

pzelasko · 2026-05-12T15:05:08Z

/ok to test 8653bac

DongjiGao · 2026-05-12T16:08:47Z

/ok to test 83974ea

github-actions · 2026-05-13T00:39:28Z

[🤖]: Hi @DongjiGao 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

Preserve SpeechLM perception checkpoint dtype

1dfc2c5

Avoid forcing the SpeechLM audio perception module to FP32 during vLLM inference so BF16 checkpoints can run the encoder in their stored dtype while keeping raw audio preprocessing in FP32. Signed-off-by: Dongji Gao <dongjig@nvidia.com>

DongjiGao requested a review from pzelasko May 11, 2026 17:22

DongjiGao added 2 commits May 11, 2026 10:34

Keep perception dtype cast local to vLLM plugin

a5a1623

Move the processed-feature dtype cast out of the shared perception module and into the SpeechLM vLLM model path so this fix remains scoped to plugin inference. Signed-off-by: Dongji Gao <dongjig@nvidia.com>

Remove shared perception module change

36c35e2

Keep the dtype conversion scoped to the SpeechLM vLLM plugin path and leave the shared perception module unchanged. Signed-off-by: Dongji Gao <dongjig@nvidia.com>

DongjiGao force-pushed the speechlm-perception-checkpoint-dtype branch from e369a20 to 36c35e2 Compare May 11, 2026 17:40

pzelasko reviewed May 11, 2026

View reviewed changes

Comment thread nemo/collections/speechlm2/vllm/salm/model.py Outdated

Comment thread nemo/collections/speechlm2/vllm/salm/model.py Outdated

Comment thread nemo/collections/speechlm2/vllm/salm/model.py Outdated

DongjiGao added 3 commits May 11, 2026 11:08

Use explicit preprocessing in SpeechLM vLLM path

3267050

Call the audio preprocessor directly before casting features to the perception encoder dtype, keeping the dtype fix scoped to the plugin inference path. Signed-off-by: Dongji Gao <dongjig@nvidia.com>

Load perception weights without redundant recast

cccc3c2

Keep the perception module in the checkpoint dtype while loading the original tensors directly. Signed-off-by: Dongji Gao <dongjig@nvidia.com>

Keep SpeechLM audio embedding cast unchanged

fd220b6

Preserve the existing BF16 LLM boundary cast and keep the PR focused on avoiding FP32 perception weights. Signed-off-by: Dongji Gao <dongjig@nvidia.com>

pzelasko reviewed May 11, 2026

View reviewed changes

Comment thread nemo/collections/speechlm2/vllm/salm/model.py Outdated

DongjiGao added 3 commits May 11, 2026 12:07

Use explicit perception dtype for SpeechLM vLLM

4eb6d17

Keep raw audio preprocessing in FP32 and run perception in BF16 for the vLLM plugin path without extra defensive dtype detection. Signed-off-by: Dongji Gao <dongjig@nvidia.com>

Avoid redundant audio embedding dtype cast

311eec8

Perception outputs already follow the plugin perception dtype, so avoid an extra cast before returning audio embeddings. Signed-off-by: Dongji Gao <dongjig@nvidia.com>

Use normal perception forward path

82fa4af

Rely on AudioPerceptionModule to handle preprocessing and encoder handoff after the plugin sets the perception module dtype. Signed-off-by: Dongji Gao <dongjig@nvidia.com>

pzelasko approved these changes May 11, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to test May 11, 2026 20:22 Inactive

pzelasko enabled auto-merge (squash) May 11, 2026 20:32

DongjiGao added 2 commits May 11, 2026 14:12

Merge branch 'main' into speechlm-perception-checkpoint-dtype

29d2934

Merge branch 'main' into speechlm-perception-checkpoint-dtype

8653bac

copy-pr-bot Bot temporarily deployed to public May 12, 2026 15:06 Inactive

copy-pr-bot Bot had a problem deploying to test May 12, 2026 15:07 Error

copy-pr-bot Bot temporarily deployed to public May 12, 2026 15:09 Inactive

copy-pr-bot Bot temporarily deployed to public May 12, 2026 15:10 Inactive

copy-pr-bot Bot temporarily deployed to public May 12, 2026 15:14 Inactive

Merge branch 'main' into speechlm-perception-checkpoint-dtype

83974ea

copy-pr-bot Bot temporarily deployed to public May 12, 2026 16:09 Inactive

copy-pr-bot Bot temporarily deployed to test May 12, 2026 16:11 Inactive

copy-pr-bot Bot temporarily deployed to public May 12, 2026 16:14 Inactive

copy-pr-bot Bot temporarily deployed to public May 12, 2026 16:15 Inactive

copy-pr-bot Bot temporarily deployed to public May 12, 2026 16:18 Inactive

pzelasko merged commit 5a855e7 into NVIDIA-NeMo:main May 13, 2026
153 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve SpeechLM perception checkpoint dtype#15686

Preserve SpeechLM perception checkpoint dtype#15686
pzelasko merged 12 commits into
NVIDIA-NeMo:mainfrom
DongjiGao:speechlm-perception-checkpoint-dtype

DongjiGao commented May 11, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 11, 2026

Uh oh!

pzelasko left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pzelasko commented May 11, 2026

Uh oh!

pzelasko commented May 12, 2026

Uh oh!

DongjiGao commented May 12, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DongjiGao commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

copy-pr-bot Bot commented May 11, 2026

Uh oh!

pzelasko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pzelasko commented May 11, 2026

Uh oh!

pzelasko commented May 12, 2026

Uh oh!

DongjiGao commented May 12, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DongjiGao commented May 11, 2026 •

edited

Loading